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Abstract. A new high-level interface to multi-threading in Prolog, im- 
plemented in hProlog, is described. Modern CPUs often contain multiple 
cores and through high-level multi-threading a programmer can leverage 
this power without having to worry about low-level details. Two com- 
mon types of high-level explicit parallelism are discussed: independent 
and-parallelism and competitive or-parallelism. A new type of explicit 
parallelism, pipeline parallelism, is proposed. This new type can be used 
in certain cases where independent and-parallelism and competitive or- 
parallelism cannot be used. 



1 Introduction 

Modern CPUs often have multiple cores and thus are capable of executing mul- 
tiple threads concurrently. This makes fully exploiting the processing power of 
CPUs a non-trivial problem. There are various Prolog implementations with 
support for multi-threading. Some systems have implemented implicit paral- 
lelism |10I1I2| , where general Prolog programs are automatically parallelised by 
either concurrently executing multiple goals or concurrently exploring multiple 
branches in the problem tree. However, these implementations turned out to be 
hard to maintain and they also make it hard to control the degree of parallelism. 

Another way to exploit multi-threading is through explicit parallelism. With 
explicit parallelism, the programmer specifies exactly how the program should 
be parallelised. This specification can vary in granularity. For example, some im- 
plementations use a low- level interface based on POSIX threads |14I11I3| . These 
implementations allow a great deal of control over the underlying threads. How- 
ever, since these interfaces originate from a procedural API originally written for 
the C programming language, they are not very declarative. Other implemen- 
tations have gone a different route, providing high-level interfaces tailored to 
specific types of problems. It is hard to classify implementations strictly based 
on this distinction, since a number of implementations provide both low-level 
and high-level interfaces. 

Two types of parallelism are competitive or-parallelism [5] and independent 
and-parallelism |6l8j . With the former, a disjunction of alternative goals is ex- 
ecuted concurrently by letting each goal compete to provide the solution to a 
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single problem. The first solution to become available is used and the remaining 
goals are stopped. With the latter, each goal in a conjunction of independent 
goals is executed concurrently. This type of parallelism is not suited for conjunc- 
tions where goals depend on each other. 

In this paper, we describe a moderately high-level interface with which one 
can implement high-level problem-specific predicates. Our interface is inspired 
by the interface found in LeanProlog [13J and tries to relieve the programmer 
of having to care for low-level details, allowing him to focus on the problems at 
hand. On the other hand, we have tried to make our interface general enough 
to be suitable for a variety of problem types. The interface can be situated 
in between high-level interfaces specific to certain types of problems, and the 
low-level POSIX interface. We have implemented the API in hProlog, a Prolog 
implementation written in C and the successor to dProlog [Sj. The implemen- 
tation required relatively few architectural changes, mainly synchronisation of 
some data areas and making other data areas thread-local. 

We describe our interface and its related concepts in [Section 21 In [Section 31 
and [Section "4l we show how our interface can be used to implement competitive 
or-parallelism and independent and-parallelism. ISection~5l describes a new type 
of parallelism which we call pipeline parallelism. In [Section 61 results from bench- 
marks testing the performance of our implementation with the three discussed 
types of parallelism are presented. Finally, in [Section 7[ we compare our interface 
and implementation with other implementations and in [Section's! we formulate 
our conclusion. 

2 Multi-Threading Support 

The language constructs we describe are centered around threads, identified by 
opaque terms called thread IDs. Each thread concurrently executes a goal in a 
separate Prolog engine. A thread is automatically terminated after generating 
all solutions to its goal. The solutions are made available as soon as they are 
generated. 

2.1 Spawning Threads 

Threads are spawned using the spawn/3 predicate. 
spawn(AnswerPattern, Goal, ID) 



spawn/3 spawns a new thread executing a copy of Goal. On success of the call 
to spawn/3, ID is unified with the thread's ID. AnswerPattern is a term that 
can contain variables from Goal. 

Threads share as much memory areas as possible. The only areas that are 
thread-private are the trail, the local stack, the global stack, and the choice point 
stack. All these data areas are expanded as needed. At the lower level, Prolog 
threads are mapped to POSIX threads. Thus, all Prolog threads are scheduled 
by the operating system. 
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2.2 Message Passing 

Each time a thread generates a solution to its goal, a copy of the term 
the (AnswerPattern) is sent to the thread's default recipient, which in gen- 
eral is the thread calling spawn/3. Using an answer pattern, one can specify 
exactly which variables need to be returned to the default recipient, minimising 
the overhead of copying answers. 

After a thread has generated all solutions to its goal, the no atom is sent 
to the default recipient to notify it of the threads termination. Because answers 
are wrapped in the/1 and termination messages are not, a recipient can easily 
differentiate between the two. 

Messages can also be sent explicitly: 

send (Term) 



sends a copy of the (Term) to the default recipient, while 
sendClD, Term) 



sends the copy to the thread identified by ID. Threads receive messages by calling 
receive: 

receive (Term) 
receive (ID, Term) 



receive/1 consumes the first message in the thread's inbox, regardless of mes- 
sage's sender. When ID is free, receive/2 unifies it with the ID of the message's 
sender. If there are no messages, the call to receive/1 or receive/2 blocks until 
a message becomes available. If ID is a valid thread ID, the first message sent 
by that thread is consumed. Currently, both send and receive fail if the thread 
ID is invalid. Because every message is a copy of a term, further alterations to 
an already sent term do not propagate to its copy. This eliminates the need for 
synchronisation and simplifies the design of the message passing API. 

2.3 Terminating Threads 

A thread can be (preemptively) terminated by issuing a call to 
stop (ID) 



Every thread may terminate another thread. When a thread is terminated, the 
threads it has spawned are not terminated as well. 

The virtual machine machine contains several checkpoints where the cancel- 
lation of a thread is checked. These checkpoints are located in the same places 
as where heap overflows are checked. Because cancellation is not immediate, 
a thread may continue executing for a short time until such a checkpoint is 
reached. After reaching a cancellation checkpoint, resources are freed and the 
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thread is shut down. Terminating a thread is an asynchronous operation: the 
call to stop/1 returns without waiting for the thread to fully shut down. After 
stopping a thread, its ID must not be used anymore. As a temporary measure, 
any messages from the terminated thread remaining in the calling thread's In- 
box are purged. What happens to other messages remaining in other threads' 
inboxes is currently undefined. 

2.4 Hubs 

Hubs are message queues ^12j . identified by a hub ID. They are created by calling 



hub/1. 


hub(HublD) 


By specifying the hub's ID, 


messages can be sent to and received from it. 


send(HubID, Term) 




receive (HubID, Term) 




receive (HubID, ThreadID, 


Term) 



In this case receive/2 receives a message from the hub specified by HubID, 
regardless of the message's sender, while receive/3 receives a message from the 
hub specified by HubID. If ThreadID is a valid thread ID, the first message sent 
by that thread is consumed. Otherwise, if ThreadID is unbound, it is unified 
with the ID of the message's sender. 

Threads can be linked to a hub: when the hub is terminated, all of its linked 
threads are automatically and synchronously terminated. When a thread is 
linked to a hub, the hub becomes the thread's default recipient. As such, the 
solutions the thread generates, as well as its termination message, are sent to 
the hub it is linked to. To spawn a thread linked to a hub 

spawn_link (HubID, AnswerPattern, Goal, ID) 



is used, where HubID is the ID of the hub. 

As mentioned, when a hub is stopped using stop/1, all linked threads are 
also stopped. The call to stop/1 is synchronous: it blocks until all the linked 
threads have fully shut down. As such, after stopping a hub, one can be sure 
that the resources of any linked threads have been freed. 

2.5 Limitations 

Currently, there is a limit on the number of threads that can be simultaneously 
executed. If desired, this limit can be specified at run time. There are as many 
thread IDs as the maximum number of simultaneous threads and these IDs are 
recycled after they are discarded. However a thread's ID cannot be discarded as 
long as stop/ 1 isn't called, signalling that a thread will not be used anymore. 
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As such, when a large number of threads are created but never stopped, one can 
run out of thread IDs, even if these threads are not executed simultaneously. 

Currently, there is no explicit support for synchronisation between threads. 
However, we expect that implementing synchronisation constructs will not re- 
quire drastic changes. 

The next sections discuss three types of parallellism and shows how they can be 
implemented using the language constructs described earlier. 



3 Competitive Or-Parallelism 

The concept of competitive or-parallelism, as described in 0, "is based on the 
interpretation of an explicit disjunction of subgoals as a set of concurrent al- 
ternatives, each running in its own thread". The subgoals, each implementing 
a different algorithm, compete to provide the solution to the problem. When a 
solution is found, the remaining subgoals are stopped. 

Competitive or-parallelism is useful when there exist multiple alternative 
algorithms to solve a single problem. A given algorithm might perform better 
for one instance of the problem, while it might perform worse for another. As 
such, the performance of an algorithm depends on the specifics of the problem. 
Alternatively, an algorithm may never terminate, requiring the use of competitive 
or-parallelism to guarantee that a solution is always found. 

Logtalk [8J provides a threaded/ 1 predicate for competitive or-parallelism. 
It accepts a disjunction of goals and runs these concurrently. The first solution is 
returned to the user by binding the variables in the disjunction. The remaining 
threads are automatically terminated. 

The following example, adapted from |9j, solves the water jugs proble using 
three competing algorithms: breadth first, depth first and hill climbing. 



solve (Jugs, Moves) :- 
threaded ( ( 

breadth_f irst_solve(Jugs, Moves) 
; depth_f irst_solve(Jugs, Moves) 
; hill_climbing_solve ( Jugs , Moves) 

)). 



The call to threaded/ 1 is semideterministic and opaque to cuts; there is no 
backtracking over completed calls. One can achieve the same result using the 
language constructs described in [Section 21 



^ Given several jugs of different capacities, we want to measure a certain amount of 
water. Jugs can be filled or emptied or their contents can be transferred to another 
jug. 
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solve (Jugs, Moves) :- 
hub(H) , 

spawn_link(H, Moves, breadth_f irst_solve ( Jugs , Moves), Tl) , 
spawn_link(H, Moves, depth_f irst_solve(Jugs, Moves), T2) , 
spawn_link(H, Moves, hill_climbing_solve(Jugs, Moves), T3) , 
receive(H, the(Moves)), 
stop(H) . 



Note that we have also implemented threaded/ 1 using our language constructs. 
However for the sake of brevity, we only show how the same result can be 
achieved, without showing the full implementation of threaded/ 1. 

4 Independent And-Parallelism 

The concept of independent and-parallelism consists of an explicit conjunction of 
subgoals, each running concurrently. In contrast to competitive or-parallelism, 
the subgoals do not compete to provide the solution to a problem. Instead, 
the conjunction is interpreted as a set of parallelisable goals, all of which need 
to succeed for the whole conjunction to succeed. In order for the goals to be 
parallelisable, goals cannot in general use each other's output as their input. 

The threaded/ 1 predicate described above, also lends itself to independent 
and-parallelism. For example, if we want to calculate a given Fibonacci number 
using two threads, we could use 



fibonacci(N, F) :- 




Nl is N-1, 




N2 is N-2, 




threaded ( ( 




do_f ibonacci (Nl , 


Fl), 


do_f ibonacci (N2 , 


F2) 


)), 




F is Fl + F2. 





As with competitive or-parallelism, the call to threaded/ 1 is semideterministic. 
The same result can be achieved using our language constructs described in 
[Section 21 



fibonacci(N, F) :- 






Nl is N-1, 






N2 is N-2, 






spawn(Fl, do_f ibonacci (Nl , 


Fl), 


Tl), 


spawn(F2, do_f ibonacci (N2 , 


F2), 


T2), 


receive (Tl, the(Fl)), 






receive (T2, the(F2)), 






stop(Tl) , 






stop(T2) , 






F is Fl + F2. 
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As illustrated, our interface provides everything to implement both competitive 
or- and independent and-parallelism. 

5 Pipeline Parallelism 

In the previous sections we have described ways to use high-level multi-threading 
constructs for different types of parallelism. However, until now these types were 
restricted to either competitive or-parallelism or independent and-parallelism. 
In this section, we describe a new type of parallelism, one which can be used to 
concurrently execute conjunctions in which goals depend on each other. When 
parallelising such conjunctions, each goal needs to be solved in the order in 
which it appears. Additionally we want the solutions to the conjunction to ap- 
pear in the same order as they would when the set of goals was not executed 
concurrently. Independent and-parallelism is not suited for use with these types 
of conjunctions. 

We propose an approach similar to the instruction pipeline found in modern 
CPUs. As an example, we describe the concept using a conjunction of three 
goals. [Figure l| is an illustration for this example. 

Parallelising the example 3-goal conjunction gl(X,Y), g2(Y,Z), g3(Z,W) 
using a pipeline consists of the following stages: 

Preparation 

A hub is spawned to hold the results of the pipeline. Three threads linked 
to the hub are spawned, one for each goal in the conjunction. 
First stage 

The first thread starts executing gl(X,Y). It generates a solution with ac- 
companying variable bindings [X,Y] and sends those to the second thread. 
After forwarding these bindings, it immediately backtracks to find the next 
solution to its goal, until all solutions have been found. 
Second stage 

Meanwhile, the second thread waits for bindings [X , Y] from the first thread. 
As soon as it receives these bindings it executes g2(Y,Z) using them, for- 
warding the resulting bindings [X,Y,Z] each time a solution is found. It does 
this until all bindings from the first thread have been received. 
Third stage 

The third thread behaves just like the second, generating all solutions to 
g3(Z,W) for each set of bindings [X,Y,Z] it receives from the second thread. 
The variable bindings [X,Y,Z,W] generated by this thread are sent to the 
hub. 



5.1 Pipeline Implementation 

Here we offer a general implementation of a pipeline using the language con- 
structs discussed in [Section 21 as the predicate piped./2. This predicate can be 
used as follows: 
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pipeline start 

backtrack 
j^T^l"^^^^^ '^^^^'^^^Vv^ thread 1 

; start > generate — — — > [X , Y] 

— backtrack 

- - ,,,.^-^^'''=^^-^~~^"^*^»Ns5v thread 2 



; woit--j- >- generate — — — >■ [X,Y,Z] 

'b^tr^k_ _ - " thread 3 

; wait — — ^ generate — — — >■ [X , Y , Z , W] 



backtrack 

pipeline results 

Fig. 1. Pipelined execution of a conjunction of three goals. 



?- piped ( (member (X, [1,2]), member(X, [2,3])), ID), 

stop(ID) . 
X = 2 



As shown, the first and only solution to the given conjunction is generated, after 
which the pipeline is stopped. 

The predicate piped/2 can be implemented as follows: 



pipedCGoals, ID) :- 

term_variables (Goals , Vars) , 
pipe.create (Goals, Vars, Id), 
pipe_results (ID, Vars). 



First, term_variables/2 is called to extract the variables used in the conjunc- 
tion. Since this list of variables is sufficient to fully describe a partial solution, 
its use reduces overhead by eliminating the forwarding of unnecessary terms 
between stages. Next, piped/2 creates a new pipe using pipe_create/3. 



pipe.create (Goals, Vars, End) :- 
hub (End) , 

spawn_pipe_stages(End, Vars, Goals, [Head I Stages] ) , 
link_pipe_stages ( [Head I Stages] , End) , 
send (Head, _) , 
send(Head, done) . 



pipe_create/3 creates a new pipe from a conjunction of goals, returning the 
pipe's ID. spawn_pipe_stages/4 spawns a thread for every goal in the conjunc- 
tion. The threads are linked to the hub, which also acts as the end of the pipe. 
The ID of the hub is also used as the ID of the pipeline. 
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spawn_pipe_stages (End, Vars, (Goal, Goals), [ID I IDs]) :- 
!, spawn_link(End, [] , pipe_stage(Vars, Goal), ID), 
spawn_pipe_stages(End, Vars, Goals, IDs). 

spawn_pipe_stages(End, Vars, (Goal), [ID]) :- 

spawn_link(End, [] , pipe_stage (Vars , Goal), ID). 



In turn, link_pipe_stages/2 links together tlie stages in the pipe by sending 
each thread the ID of the stage following it. 



link_pipe_stages ( [Stage] , End) 




!, send(Stage, End). 




link_pipe_stages ( [Stage , Next I 


IDs] , End) : - 


send(Stage, Next), 




link_pipe_stages ( [Next 


1 IDs] , End) . 



Each thread is started with pipe_stage/2 as its start goal. This predicate han- 
dles the forwarding of (partial) solutions, backtracking and the termination of 
each stage in the pipeline. 



pipe_stage (Vars , Goal) :- 
receive (the (Next) ) , 




repeat , 

receive (_ , the ( In) ) , 

( 

In == done -> send(Next, done), ! 




, fail 


; Vars = In 
), 

Goal, 






send(Next, Vars), 




fail. 





This is everything we need to create a pipeline. Note that in pipe_create/3, 
after spawning and linking the stages of the pipeline, a dummy variable and 
the message done are immediately sent to the head of the pipe to make it start 
generating answers. 

Finally, after creating and starting a new pipe, piped/2 calls pipe_results/2. 
This is a backtrackable predicate that unifies Vars with a result from the pipeline 
until no more results are available. 

pipe_results(End, Vars) :- 
repeat , 

receive (End, In), 
( 

In == the (done) -> !, fail 
; the (Vars) = In 
). 
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Note that piped/2 returns an ID. As previously mentioned, this ID is actually 
the ID of the hub to which all stages in the pipeline are linked. As such, calling 
stop/1 on the ID terminates the whole pipeline. 

5.2 A Pipelined Findall 

As an example, we show a version of f indall/3 that executes its goals using 
pipeline parallelism, while retaining the semantics of f indall/3. 

pipe<i_findall (Pattern, Goals, Results) :- 

term_variables (Goals , Vars) , 
pipe_create (Goals, Pattern+Vars , ID), 
pipe_all_results(ID, Results), 
stop(ID) . 



The predicate pipe_all_results/2 collects all results from the pipeline, return- 
ing them in a list. 

pipe_all_results (End, Results) :- 
receive (End, In), 
( 

In == the (done) -> Results = [] 
; In == no -> pipe_all_results(End, Results) 
; the (Result+_) = In, 

Results = [Result I Rest] , 

pipe_all_results(End, Rest) 

). 



Note that because goals are executed and backtracked in the order in which they 
occur in the conjunction, the results of piped_f indall/3 are in the same order 
as those of f indall/3. 

5.3 Identifying Pipeline Parallelisable Problems 

Conjunctions can be sped up using pipeline parallelism because in the pipeline 
backtracking occurs on a goal-per-goal basis. This means that while one goal 
is being backtracked, other goals can simultaneously execute and generate new 
(partial) solutions. Every conjunction can be executed using pipeline parallelism. 
However, this does not mean that every conjunction benefits from it. 

Each stage in a pipeline can be seen as a consumer, accepting partial solutions 
from a previous stage. On the other hand, stages can also be seen as producers, 
using the partial solutions they receive to generate new (partial) solutions. When 
a stage's goal is non-deterministic, we call the stage a non-deterministic producer. 
Non-deterministic producers can produce more than one new partial solution for 
each partial solution that they receive. 

Only pipelines containing one or more non-deterministic producers have the 
potential to speed up execution. In these types of pipelines, a non-deterministic 
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producer can produce multiple partial solutions, while subsequent stages are 
simultaneously consuming them. If no non-deterministic producer is present in 
a pipeline, each stage will only start executing when the previous stage has 
terminated. Because of this, a conjunction consisting of fully (semi) deterministic 
subgoals, cannot be sped up by pipeline parallelism. Note that even though a 
conjunction of subgoals may be (semi)deterministic, its subgoals may not be, 
and therefore it is still possible that pipeline parallelism results in a speedup. 
Because concurrency in a pipeline is only possible starting from the first non- 
deterministic producer, it is useless to create a pipeline in which the first goals 
are (semi)deterministic. It is better to exclude these goals from the pipeline. 

Besides taking into account the determinism of subgoals, one also needs to 
take into account the size of the workload represented by a subgoal. Very small 
goals such as X is Y*3 are usually very fast to execute and as such do not 
represent a big workload. Therefore, it can be better to decrease the granularity 
of the parallelisation by grouping these types of subgoals with other subgoals. A 
method for compile-time granularity estimation can be found in ^ . 

6 Performance 

We present results from three different benchmarks, each testing one of the three 
types of parallelism discussed earlier: competitive or-parallelism, independent 
and-parallelism and pipeline parallelism. 

All experiments were run on a machine with two quad-core Intel Xeon E5620 
CPUs running at 2.40GIIz, supporting a total of sixteen threads using Hyper- 
Threading. The machine has a total of 24GB of memory and runs Linux 2.6.32. 
Benchmarks were run with a 64-bit version of hProlog and results were compared 
with Logtalk version 2.42.4 with a 64-bit SWT-Prolog 5.11.18 backend. 

6.1 Independent And-Parallelism 

The Towers of Hanoi problem is easily parallelisable using independent and- 
parallelism. The benchmark solved the problem recursively and we compared 
the running times when using 1, 2, 4, 8 and 16 threads. 

The plot in [Figure 2] shows that our implementation scales relatively well 
to a high number of threads. When using 2 and 4 threads we achieve a nearly 
ideal speedup, halving running times when doubling the amount of threads. This 
occurs especially with a larger amount of rings. However, the efficieny starts to 
decrease at 8 threads and with 30 rings the decrease at 16 threads is very drastic. 
The behaviour at 16 threads is caused by the fact that there are only 8 physical 
processor cores but 16 hardware threads (provided by HyperThreading). Thus, 
using 16 threads pushes the limits of what these hardware threads can provide. 

As the plot shows, hProlog seems to be more efficient than Logtalk with 
SWI-Prolog as its backend, especially at higher threads counts. Note that as a 
baseline, for this benchmark hProlog is on average 2 times faster than Logtalk 
with SWI-Prolog as its backend. This also explains the speedups at 24 rings in 
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Towers of Hanoi (hProlog) 




Number of threads 
Towers of Hanoi (Logtalk) 




Number of threads 



Fig. 2. Towers of Hanoi: speedup relative to single-threaded solution in hProlog and 
in Logtalk, average over 10 iterations 



hProlog: because hProlog is so fast, the relative overhead of multiple threads is 
bigger in hProlog than it is in Logtalk, resulting in a smaller efficiency. 



6.2 Competitive Or-Parallelism 

We used the previously mentioned water jugs problem to benchmark the per- 
formance of competitive or-parallelism. We let a hill- climbing, depth-first and 
breadth-first search algorithm compete to provide the solution to the problem. 
To compare the result of competitive or-parallelism with a single algorithm, the 
algorithms were also run separately. The size of the jugs was 5 and 9 liters. 



High-Level Multi-Threading in hProlog 13 



Table 1. Water jugs: slowdown compared to fastest algorithm, average of 25 iterations, 
with absolute running time of the fastest algorithm in seconds 



Liters 


Hill Climbing 


Depth First 


Breadth First 


Competitive 


Time 


1 


1.38 


1571.11 


1.00 


1.78 


0.00252 


2 


1.00 


1.86 


1.19 


1.02 


2.15184 


3 


1.00 


14127.60 


683.20 


1.80 


0.00020 


4 


1.00 


34055.00 


2.50 


3.00 


0.00008 


6 


1.62 


20.37 


1.00 


1.60 


0.02208 


7 


1.00 


2270.60 


58448.00 


2.40 


0.00020 


8 


1.00 


2036.75 


70.50 


1.75 


0.00016 


9 


1.00 


7805.00 


2.00 


4.00 


0.00004 


11 


14.82 


1.00 


3.97 


1.17 


0.05180 


12 


1.00 


3895.36 


1752.50 


1.29 


0.00056 


13 


2.41 


89.14 


1.00 


1.16 


0.00148 


14 


2.14 


133.29 


1.00 


2.14 


0.00028 



As the results in ITable II show, the competitive or-parallel running time is 
always somewhat slower than the fastest algorithm run separately. However, 
this slowdown is to be expected and can be ascribed to thread scheduling, the 
cost of creating and shutting down threads and the cost of allocating memory 
resources for Prolog engines for each thread. Most of the time, the competitive or- 
parallel approach is less than 2 times slower than the fastest algorithm. However, 
in general, the competitive or-parallel approach is considerably faster than the 
slowest algorithm. 

In the case of 9 liters, the competitive or-parallel approach is 4 times slower 
than the hill climbiirg algorithm, while the breadth first algorithm is just 2 times 
slower. A similar event occurs at 4 and 1 liters. These occurrences are probably 
explained by the overhead of multi-threading and the small running times of the 
fastest algorithm. 

The results for this benchmark are very similar to those obtained by running 
the same benchmark using Logtalk. Logtalk also exhibits similar behaviour at 4 
and 9 liters. 

6.3 Pipeline Parallelism 

Finding the intersection of a irumber of sets is a problem that is not natural to 
parallelise using indepeirdent and- or competitive or-parallelism. However, us- 
ing pipeline parallelism, oire can achieve considerable speedups, dependiirg on 
the number of sets and their size. We compared the performance of the regu- 
lar f indall/3 predicate to that of the previously described piped_f indall/3 
predicate. 

To calculate the intersection between n sets, we used conjunctions of the 
form 

member(X, LI), member(X, L2) , member(X, Ln) 
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where LI, L2, hn are lists of a given length. For example, we used 



?- findalKX, (member(X, LI), member(X, L2)), R) . 

?- piped_f indalKX, (member(X, LI), member(X, L2)), R) . 



to compare the performance of finding the intersection of two sets. We tested the 
performance of both predicates consecutively using 2, 3, 4, . . . , 16 sets. Note that 
to exploit the pipeline to its fullest extent, we have opted not to use memberchk/2 
for the goals following the first one, using ineinber/2 instead. 

When all sets are equal, checking every element results in a full traversal of 
the pipeline, maximising its use. We used this scenario as a best-case benchmark. 
When the sets do not share any element, the second inember/2 in the conjunction 
fails immediately, minimising the use of the pipeline. We used this scenario as a 
worst-case benchmark. 

The plot for the best-case scenario in |Figure^ shows that considerable speedups 
can be achieved by using piped_f indall/3. Starting at three or more sets, 
piped._f indall/3 is consistently faster than f indall/3. Overall, speedups in- 
crease as the size of the sets or the number of sets increases. However, note that 
in the case of sets of 2500 elements, the size of the speedup peaks at around 14 
threads. Past this point, the overhead incurred by pipelining probably starts to 
exceed the speedup gained from using more threads. 



The plot for the worst-case scenario in Figure 4 shows that there is a signif- 
icant slowdown when the main benefit of the pipeline is not used. The smaller 
the sets, the bigger the slowdown becomes. Increasing the number of sets also 
increases the slowdown. This is because increasing the number of sets extends 
the length of the pipeline and thus more threads need to be created. However, 




Fig. 3. Set intersection best case: average speedup compared to regular findall over 25 
iterations 
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Fig. 4. Set intersection worst case: average speedup compared to regular findall over 
25 iterations 



since partial solutions never pass the second thread, all other threads are unused 
and only create more overhead. 



7 Related Work 

There are a number of Prolog compilers implementing multi-threading support 
on different levels. Among them are Logtalk, SWI-Prolog and LeanProlog. all of 
which implement support for multi-threading in different ways. They all support 
independent and-parallelism and competitive or-parallelism either directly or 
indirectly. However, as far as we know, none of them include direct support 
for parallelism through pipelining, although we expect this can be implemented 
using existing primitives. 



7.1 Logtalk 

Logtalk is "an object-oriented programming language that can use most Prolog 
implementations as a back-end compiler". As such, it is not a Prolog compiler in 
the strict sense, but rather a layer on top of an existing compiler. Nevertheless, 
Logtalk provides an interesting comparison since it provides a number of high- 
level predicates to leverage multi-threading. One of these predicates was already 
mentioned earlier: threaded/1, threaded/1 can be used for both independent 
and-parallelism and competitive or-parallelism. 

Logtalk also provides a few high-level predicates to concurrently exe- 
cute single goals. These are threaded_call (Goal) and threaded_once (Goal) , 
which call the given goal in a separate thread. The results from these 
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goals can be accessed using threaded_exit (Goal) and threaded_peek(Goal) . 
threaded_exit/l waits for a result to become available and unifies Goal with 
that result. threaded_peek/l works the same way, but fails if no result is avail- 
able, instead of waiting. 

Note that threaded_call/l will generate a single solution to its goal in a 
new thread and then suspend itself. Only after threaded_exit/l has consumed 
a solution will the next solution be generated. This is different from our approach, 
where a thread is not suspended after it has generated a solution. 

Logtalk also supports one-way asynchronous calls with 
threaded_ignore (Goal) , which executes the goal in a thread and then 
succeeds. 

Since Logtalk can use a number of Prolog compilers as its backend, all multi- 
threading predicates in Logtalk are implemented on top of a low-level API, 
specified by the ISO standardisation proposal for multi-threading support in 
Prolog [7j. 

7.2 ISO Standardisation Proposal for Multi-Threading Support in 
Prolog 

This ISO standardisation proposal is based on the design found in SWI-Prolog 
|14) . The predicates have also been implemented in a number of other Prolog 
systems, including XSB [Ij and YAP [3]. 

The low-level multi-threading predicates described in the proposal are based 
on the semantics of POSIX threads. The predicates form a comprehensive set, 
supporting the creation and destruction of threads, message queues, and mu- 
texes. The predicates for creating threads accept various options. Some of these 
options allow specifying low-level details of thread creation, such as the limit to 
which the global stack, local stack, C stack or trail can grow. For more details, 
we point the reader to 0. 

Compared to the predicates in this API, our language constructs are more 
high-level. They do not allow such fine-grained control over limits of memory 
areas. Also, we do not support mutexes (yet). Our goal was to create a high- 
level interface to multi-threading which still allows a relatively high degree of 
control. As such, we intentionally separated the semantics of POSIX threads 
from our interface. 

7.3 SWI-Prolog 

On top of the low-level multi-threading support, SWI-Prolog also provides two 
higher-level predicates for using independent and-parallelism and competitive 
or-parallelism |15) : 



concurrent (N, Goals, Options) 

f irst_solution(X, Goals, Options) 
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coiicurreiit/3 accepts a list of independent goals and executes them concur- 
rently using N threads. Contrast this to Logtalk's implementation of threaded/ 1, 
where the number of threads to use cannot be specified, f irst_solution/3 calls 
a list of goals and uses the first result that is calculated. X functions as the answer 
pattern, specifying what variables need to be returned. 

The variable Options is a list of options that is passed to the low- level pred- 
icates that create threads. 

7.4 LeanProlog 

Out of all Prolog compilers that we have mentioned, LeanProlog's high- 
level multi-threading support most closely matches ours. Various aspects of our 
interface originated from ideas implemented in LeanProlog and its predecessor 
BinProlog [12]. We have also implemented the inulti_f old/3 predicate described 
in [13j using our language constructs. However, for the sake of brevity we have 
not included this implementation here. 

LeanProlog's support is focused on an Interactor APL LeanProlog supports 
what we call threads linked to a hub. By default, threads share the code zone 
but have separate symbol tables. LeanProlog supports symbol garbage collec- 
tion and separate symbol tables allow garbage collection to occur safely in mul- 
tiple threads, without the need for synchronisation. One can optionally specify 
whether the code zone must also be cloned. 

8 Conclusion and Future Work 

We have shown that our interface provides the means to build high-level multi- 
threading constructs and that our implementation's performance is comparable 
to that of Logtalk 2.42.4 on top of SWLProlog 5.11.18. We have discussed two 
common types of high-level parallelism and have identified some of their limita- 
tions. Based on these observations, we have proposed a new type of parallelism, 
pipeline parallelism, and we have identified the types of problems where this 
type of parallelism is suitable. 

As for future work, we plan to search for real-world problems for which 
pipeline parallelism is useful. Implicit pipeline parallelism with automatic gran- 
ularity analysis is another topic we would like to research. Furthermore we would 
like to address some of the limitations of our current design as outlined in 
[Section 2.51 We would also like to implement better support for exceptions and 
make better use of them by using them in our language constructs to signal 
errors. 
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